Day 3: Understanding Tokenization and Embeddings
LLMs cannot directly understand text. The first step of converting text into numbers is tokenization, and converting those numbers into meaningful vectors is embedding.
Tokenization Algorithm Comparison
| Algorithm | Used By | Characteristics |
|---|---|---|
| BPE (Byte Pair Encoding) | GPT series | Repeatedly merges the most frequent byte pairs |
| WordPiece | BERT | Similar to BPE but merges based on likelihood |
| SentencePiece | T5, Llama | Language-independent, treats spaces as tokens |
| Unigram | mBART | Starts with a large vocabulary and removes low-probability tokens |
Tokenization Practice with tiktoken
# pip install tiktoken
import tiktoken
# Tokenizer used by GPT-4
encoder = tiktoken.encoding_for_model("gpt-4")
text_ko = "Large language models understand natural language"
text_en = "Large language models understand natural language"
tokens_ko = encoder.encode(text_ko)
tokens_en = encoder.encode(text_en)
print(f"Korean: {len(tokens_ko)} tokens -> {tokens_ko}")
print(f"English: {len(tokens_en)} tokens -> {tokens_en}")
# Decode tokens back to text
for token_id in tokens_ko:
print(f" {token_id} -> '{encoder.decode([token_id])}'")
Korean requires more tokens than English to express the same meaning. This directly impacts cost and context window utilization.
BPE Algorithm Implementation
def simple_bpe(corpus, num_merges):
"""Simplified BPE algorithm"""
# Initial: split into individual characters
vocab = {}
for word in corpus:
chars = list(word) + ["</w>"]
key = " ".join(chars)
vocab[key] = vocab.get(key, 0) + 1
for i in range(num_merges):
pairs = {}
for word, freq in vocab.items():
symbols = word.split()
for j in range(len(symbols) - 1):
pair = (symbols[j], symbols[j + 1])
pairs[pair] = pairs.get(pair, 0) + freq
if not pairs:
break
best_pair = max(pairs, key=pairs.get)
print(f"Merge {i+1}: '{best_pair[0]}' + '{best_pair[1]}'")
# Merge the most frequent pair
merged = " ".join(best_pair)
replacement = "".join(best_pair)
new_vocab = {}
for word, freq in vocab.items():
new_word = word.replace(merged, replacement)
new_vocab[new_word] = freq
vocab = new_vocab
return vocab
corpus = ["low", "lower", "newest", "widest", "low", "low"]
simple_bpe(corpus, 5)
Understanding Vector Space with Word2Vec
# pip install gensim
from gensim.models import Word2Vec
# Simple training data
sentences = [
["king", "and", "queen", "live", "in", "palace"],
["queen", "rules", "the", "palace"],
["cat", "likes", "fish"],
["dog", "likes", "walks"],
]
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, epochs=100)
# Check word vector
print(f"'king' vector dimensions: {model.wv['king'].shape}")
# Find similar words (accuracy is low due to small training data)
similar = model.wv.most_similar("king", topn=3)
for word, score in similar:
print(f" {word}: {score:.3f}")
Tokenization is the gateway to LLMs, and embeddings are the foundation for LLMs to understand language. Tomorrow we’ll learn about the Transformer architecture that operates on top of these embeddings.
Today’s Exercises
- Install tiktoken and tokenize 3 Korean sentences and 3 English sentences. Calculate how many times more tokens Korean uses compared to English.
- Experiment with increasing the number of merges in the BPE algorithm and observe how vocabulary size and token length change.
- Explain why the famous Word2Vec relationship “king - man + woman = queen” holds from the perspective of vector space.